*This article is provided courtesy of the Minnesota Aquatic Invasive Species Research Center at the University of Minnesota, where the author is the staff Quantitative Ecologist*

This article is provided courtesy of the Minnesota Aquatic Invasive Species Research Center at the University of Minnesota, where the author is the staff Quantitative Ecologist

Intro

Recently, it came to my attention that some of the researchers in my Center were not all that familiar with the concept of p value corrections. I was initially surprised, but in retrospect, I probably shouldn’t have been! I think my first introduction to the idea actually came from a reviewer on one of my manuscripts, not from one of my classes. Even when I taught Biometry here at UMN a few years back, I probably only spent a slide or two on the idea. That suggests to me that our typical collective “training” as users of statistics has a sizable hole through which an important idea like this can slip right through!

It's time we stopped pretending that understanding __*p* values__ deeply isn't critical.

It’s time we stopped pretending that understanding p values deeply isn’t critical.

So, I thought this would make a great topic for a blog post, in no small part because it gives me another soapbox upon which to recap some important statistical concepts! In this post, we’ll tackle what p value correction is, why some argue it must be done (and why others argue it isn’t actually necessary at all), and, perhaps most importantly, how we can do it using R. Along the way, we’ll also review what a p value even is and where one comes from as well as how a p value relates to the other critical concepts of the null hypothesis and of alpha.

A Motivating Example

I think this topic will be much more interesting for all of us if we frame it within the context of a problem we can all relate to. Since MAISRC, the research center I belong to, focuses on aquatic invasive species research, we’ll create an example inspired by that realm.

So, here goes: Imagine that you know of 18 lakes that have become infested with a new invasive aquatic plant called Cyclopean Eelwort (because that sounds appropriately sinister). You’ve developed a new herbicide that you hope will eradicate Cyclopean Eelwort without substantially harming any native aquatic plant species in the lakes where you use it.

Let's all meet Cyclopean Eelwort, compliments of Dall-E.

Let’s all meet Cyclopean Eelwort, compliments of Dall-E.

So, you set up an experimental field trial in which you treat one small, infested bay in every lake with your herbicide. You measure the standing of Cyclopean Eelwort as well as the standings of 36 native plant taxa before and after the herbicide treatment. You are hoping to observe that Cyclopean Eelwort is significantly worse off (or at least no better off!) after versus before the treatment but that there are no significant differences (at least, not negative ones!) before versus after the treatment for any of the native species.

So far, so “ecology study,” I hope you’re thinking! Well, here’s where I will introduce a sneaky twist: Unbeknownst to you, I came into your lab in the dead of night right before you did your herbicide applications and replaced all your herbicide with Glacier Frost Gatorade. And you didn’t notice.

After all, both chemicals are a menacing color of blue and smell a bit like corroded batteries--it's honestly hard to tell the difference!

After all, both chemicals are a menacing color of blue and smell a bit like corroded batteries–it’s honestly hard to tell the difference!

But there is a key difference between the two chemicals: In this particular case, at least, we expect Glacier Frost Gatorade to be (mostly) harmless to all aquatic plants, once it has been sufficiently diluted by your herbicide applicators. As such, at least I will know, going into your study, that all the null hypotheses will be true: There really will be no difference in the standings of any plant species before versus after your treatments. Any differences we do observe will simply be the products of random chance alone!

Sorry, by the way, that I’m sabotaging your project and wasting a year of your life in the process, but it’s really a small price to pay to teach important statistics concepts, right? Right. We cool? Good.

Um...no, actually, I'm not all that amused, Alex--this all had *better* be worth it!

Um…no, actually, I’m not all that amused, Alex–this all had better be worth it!

With that out of the way, let’s randomly draw some hypothetical data from your study so we have some nice, pretty numbers to ogle at. For each lake, we need before- and after-treatment, observed health data for 37 species, which we’ll represent using random three-letter codes (CEW being the code for our invasive, Cyclopean Eelwort). We’ll measure “health” in terms of density (stems/meter) so that it’s a nice continuous measure and easy to wrap our heads around, even though the biology of that doesn’t actually otherwise make a ton of sense here.

Because of my betrayal, we know that only random chance will dictate how different your “before” measurements are from your “after” measurements for the same taxon at the same lake (or how different the “before” measurements will be for the same taxon at two different lakes, for that matter). We’ll also assume that, since you are measuring density destructively (you have to chop down the plants to ID and quantify them), you won’t be measuring the exact same spot before vs after treatment, so there’s no chance your “before” measurement will directly affect your “after” measurement.

Ah, nothing feels quite so good as gathering some high-quality data with which we can protect nature!

Ah, nothing feels quite so good as gathering some high-quality data with which we can protect nature!

As such, we can randomly generate your data in R by giving each species its own mean and variance parameters for density at random, then drawing “observed” density values for every lake for both the “before” and “after” treatments from those distributions we just created.

In case you’re keeping track, we have 18 lakes x 2 treatment conditions (before and after) x 37 taxa, and 18 x 2 x 37 = 1332, so that’s how many rows our data set will have when we’re all said and done.

library(truncnorm) #We'll take advantage of the rtruncnorm() function in this package to draw from a "truncated normal distribution," which is just a typical bell (normal) curve, but truncated (cut off) such that we won't ever get observed densities less than 0, since those don't make sense.
library(tidyverse) #Access all the goodies the tidyverse has to offer, although it's just dplyr and ggplot2 we'll need here.

#First, name our 18 lakes something fun. This was an interesting exercise in free association for me! Feel free to diagnose my current mental state using them. 
lakes = c("Blue", "High", "Seven Mile", "Disciple", "Mangosteen", "Fairview", "Mountain",
          "Dynamo", "Steel Buttress", "Envelope", "Ball Mill", "Entropy", "Fad", "Column",
          "Canopy", "Silverado", "Reversal", "Hilo")

#Next, get our 36 native species codes by generating them at random. Then, tack on CEW, our invasive species. 
natives = rep(NA, 36) #Storage object

#For each species, paste three randomly selected capital letters together. 
for(species in 1:36) {
  natives[species] = paste0(sample(LETTERS, 3, replace = T), collapse = "")
}
all_taxa = c(natives, "CEW") #Add in our invasive taxon at the end.
all_taxa #See the codes we've made--very nice.
##  [1] "NPH" "QLX" "IRP" "AEJ" "SEC" "EUY" "ZND" "BEH" "GKG" "TIZ" "LLV" "XSI"
## [13] "HSF" "YIB" "IIS" "RXC" "QLS" "ZCR" "DNF" "CWX" "PNA" "SQZ" "RHB" "SAG"
## [25] "IKW" "CIB" "PQY" "TLS" "VAK" "AKK" "CGJ" "DSU" "JRV" "PNR" "WRF" "TNR"
## [37] "CEW"
##We'll now use a for loop to generate our data for each species. Remember: We first need random means and standard deviations for every species' density values. Then we need 36 random values from those distributions for each species, one value for each lake-treatment combo.
species_means = rep(NA, 37) #Storage objects.
species_sds = rep(NA, 37)

for(taxon in 1:37) {
  #First, get our mean and sd values for the current taxon and store them. 
  species_means[taxon] = rtruncnorm(1, mean = 5, sd = 3.5, a = 0, b = Inf)
  species_sds[taxon] = rtruncnorm(1, mean = 3.5, sd = 5, a = 0, b = Inf)
  
  #Then, get our observed abundances for every lake for this taxon. Because there really is no effect of the treatment, random chance alone will dictate how different the "before" vs. the "after" observed values will be. Same goes for values at different lakes--it's all just random!
  obs_dens = rtruncnorm(18*2, 
                        mean = species_means[taxon], 
                        sd = species_sds[taxon], 
                        a = 0, 
                        b = Inf)
  
  #Make a vector for our treatment codes.
  treatment = c(rep("Before", 18), rep("After", 18))
  
  #Build our datasheet or, once it's built, add this taxon's data to it.
  if(taxon == 1) {
    exp_data = data.frame(lake = lakes, species = all_taxa[taxon], 
                          dens = obs_dens, treatment = treatment)
  } else {
    tmp1 = data.frame(lake = lakes, species = all_taxa[taxon], 
                      dens = obs_dens, treatment = treatment)
    exp_data = rbind(exp_data, tmp1)
  }
}

#We'll sort our simulated data so that, for a given species, the before/after measurements are consecutive at the same lake because that will be easier to understand.
exp_data = exp_data %>% 
  arrange(species, lake, desc(treatment))

#Finally, let's get a taste of what we've made!
head(exp_data, n = 36)
##              lake species     dens treatment
## 1       Ball Mill     AEJ 4.076381    Before
## 2       Ball Mill     AEJ 6.653119     After
## 3            Blue     AEJ 5.168470    Before
## 4            Blue     AEJ 3.681656     After
## 5          Canopy     AEJ 4.332265    Before
## 6          Canopy     AEJ 8.134769     After
## 7          Column     AEJ 5.746067    Before
## 8          Column     AEJ 6.384047     After
## 9        Disciple     AEJ 4.446958    Before
## 10       Disciple     AEJ 5.367966     After
## 11         Dynamo     AEJ 6.307366    Before
## 12         Dynamo     AEJ 5.809779     After
## 13        Entropy     AEJ 6.487863    Before
## 14        Entropy     AEJ 4.656455     After
## 15       Envelope     AEJ 4.266664    Before
## 16       Envelope     AEJ 6.061860     After
## 17            Fad     AEJ 6.306911    Before
## 18            Fad     AEJ 7.855437     After
## 19       Fairview     AEJ 5.894929    Before
## 20       Fairview     AEJ 6.243810     After
## 21           High     AEJ 5.875680    Before
## 22           High     AEJ 5.163347     After
## 23           Hilo     AEJ 5.453344    Before
## 24           Hilo     AEJ 5.657601     After
## 25     Mangosteen     AEJ 5.630360    Before
## 26     Mangosteen     AEJ 3.112271     After
## 27       Mountain     AEJ 1.420229    Before
## 28       Mountain     AEJ 5.808488     After
## 29       Reversal     AEJ 6.276856    Before
## 30       Reversal     AEJ 6.643630     After
## 31     Seven Mile     AEJ 3.246711    Before
## 32     Seven Mile     AEJ 4.153874     After
## 33      Silverado     AEJ 2.922489    Before
## 34      Silverado     AEJ 7.375423     After
## 35 Steel Buttress     AEJ 5.031302    Before
## 36 Steel Buttress     AEJ 5.680889     After

Conducting Our Analyses

Now that we have our (hypothetical) data, we can proceed to the fun part (at least, it’s the fun part for wackos like me!)–getting statistical answers to our burning question: For each species, did our herbicide treatment have any effect?

Setting aside whether this is the “right” way to analyze these data, suppose that you ran a standard two-sample t-test on the data collected on each species, comparing the mean densities before versus after treatment. This is 37 separate tests, so you will get 37 separate p values. Let’s see those!

#We'll use the lm() function to do these t-test, but there's also a t-test function in R...
p_vals = rep(NA, 37) #Storage object

for(species in 1:37) {
  #First, cut just the data from this taxon out of our larger data set.
  current_species = all_taxa[species]
  sub_dat = dplyr::filter(exp_data, species == current_species)
  
  #Then, run the t test. 
  mod = lm(dens ~ treatment, data = sub_dat) 
  
  #Lastly, snag and store the p value from within the summary metadata
  p_vals[species] = summary(mod)$coefficients[2,4] 
}

#Here are the p-values we get back, sorted for your viewing pleasure:
round(sort(p_vals, decreasing = F), 4)
##  [1] 0.0099 0.0160 0.0529 0.0651 0.0674 0.0813 0.1072 0.1120 0.1385 0.1854
## [11] 0.2052 0.2287 0.3143 0.3274 0.3371 0.3399 0.3750 0.4365 0.4811 0.5060
## [21] 0.5488 0.5924 0.6093 0.6269 0.7240 0.7603 0.7636 0.8129 0.8192 0.8481
## [31] 0.8641 0.9228 0.9253 0.9323 0.9444 0.9513 0.9827

Considering (Our) P Values

Since this is a critical statistical concept to understand–one that many, even those who have taken several statistics classes, can struggle with–let’s pause at this point to refresh our memories as to what exactly a p value is.

Understanding a p value starts with understanding a null hypothesis. While this actually needn’t be the case, in most instances, a null hypothesis is the statistical equivalent of a “nothing burger.” It’s generally whatever you expect to be true if nothing cool at all is going on in your system–whatever cool-looking stuff does happen is just the result of random chance (or, in some cases, just the processes we already know are occurring). If all our null hypotheses were true, the world would probably be a very dull place!

Oh, don't be so *dramatic!* Don't worry, we'll bone up on the subject together!

Oh, don’t be so dramatic! Don’t worry, we’ll bone up on the subject together!

In our experiment, then, our null hypothesis would be that our “herbicide” (here, really just Gatorade) will have no effect on plant density, and thus the mean plant density for Taxon X before our treatment will be the same as after our treatment. Our alternative hypothesis, meanwhile, is the exact opposite of that: Our treatment will have some effect on the density of Taxon X such that mean densities for Taxon X will be different pre- vs. post-treatment.

You’ll notice that the null and alternative hypotheses are phrased to be mutually exclusive. That is, they are phrased such that either one or the other is true, but one must be true and the other must be false. This is key! By framing things this way, we create a scenario in which asking “how likely is it that the null hypothesis is wrong?” is basically the same question as asking “how likely is it that the alternative hypothesis is right?”

Cats: A living refutation of the logic of mutual exclusivity.

Cats: A living refutation of the logic of mutual exclusivity.

Why is that important? As it turns out, the former question is much easier to answer from a statistical standpoint! This is the case at least in part because we can guess more or less exactly what to expect to observe if the null hypothesis is true–that is, the results that we’d be likely to record, if we collected a lot of data and the null hypothesis were true, would be highly predictable.

So, we can ask: “How likely would it be that we’d observe data that are like ours if we assume that the null hypothesis is actually true?” If our data look “highly predictable,” which in this particular case would mean that the mean densities we observe before vs. after our treatment for a given taxon are very similar, then the answer to the question above is going to be “pretty likely.”

If, by contrast, we get really different mean densities pre- versus post-treatment, the answer is instead going to be “pretty unlikely.” Specifically, what we get back from conducting a statistical test like this is a “probability value,” or p value for short. This is the probability of receiving data as “extreme” or more “extreme” than the data you actually got if you assume the null hypothesis is true.

Since this is such a key idea to wrap our heads around, let’s really chew on it for a hot moment. Let’s take our first species, NPH, and simulate the process of drawing 1,000 “before” and 1,000 “after” density values for it, based on the mean and standard deviation parameters we randomly assigned it. Then, let’s subtract those two values (by doing after-before) to get differences and plot those differences as a histogram to see their distribution. As we do all this, let’s focus on the fact that we’re keeping the null hypothesis true. That is, there really is no systematic difference between before and after, so we’re really just drawing marbles from the same jar in the same way twice, so to speak!

#Remember--We're just doing the same draw twice!
befores = rtruncnorm(1000, mean = species_means[1], sd = species_sds[1], 0, Inf)
afters = rtruncnorm(1000, mean = species_means[1], sd = species_sds[1], 0, Inf)

#Calculate the differences
diffs = afters-befores

#Visualize those differences, using ggplot to make things look a little nicer (no need to understand what most of this code is doing!)
ggplot(data=data.frame(diffs = diffs), aes(x=diffs)) +
  geom_histogram(fill = "#7FFFD4", color="black",
                 aes(y = after_stat(density))) +
  stat_function(fun = dnorm, args = 
                  list(mean = mean(data.frame(diffs = diffs)$diffs), sd = sd(data.frame(diffs = diffs)$diffs)),
                linetype="dotted", color="gray15") +
  A.theme +
  scale_x_continuous("Differences (afters-befores)") +
  scale_y_continuous("Counts\n")

Now, as we consider the histogram above, we have to remember that the true mean density for this species, no matter when or where we go looking, will be the value we randomly assigned it, which happens to be 9.002. That means the true mean difference in densities after vs before our treatment, is 0–that is, there is no difference because the before and after means are the same number, and subtracting X from X is always going to be 0!

And yet, 0 is not always the difference we actually observed above, is it? Sure, it was the most common difference we observed, out of our 1,000 pairs of simulated observations; the peak of our histogram above is at 0. However, at least 50 or so of our 1,000 observed differences were nowhere near 0, instead being greater than |10|. That’s the randomness of the universe for you!

If pairs of observations can be noisy and wacky, it stands to reason that data sets can be too. Let’s see that–let’s draw samples of size 36 (18 befores and 18 afters) for this same species at random and then subtract the means of the two groups to get the mean differences in density. Let’s do this 1,000 times, which would be like simulating doing our experiment for just this species 1,000 times over.

all_diffs = rep(NA, 1000) #Storage object

for(exp in 1:1000) {
#Basically the same code as above.
befores = rtruncnorm(18, mean = species_means[1], sd = species_sds[1], 0, Inf)
afters = rtruncnorm(18, mean = species_means[1], sd = species_sds[1], 0, Inf)

#Calculate the differences of the mean densities of the two groups this time
all_diffs[exp] = mean(afters)-mean(befores)
}

#See the distribution of differences
ggplot(data=data.frame(all_diffs = all_diffs), aes(x=all_diffs)) +
  geom_histogram(fill = "#7FFFD4", color="black",
                 aes(y = after_stat(density))) +
  stat_function(fun = dnorm, args = 
                  list(mean = mean(data.frame(all_diffs = all_diffs)$all_diffs), sd = sd(data.frame(all_diffs = all_diffs)$all_diffs)),
                linetype="dotted", color="gray15") +
  A.theme +
  scale_x_continuous("Group differences (afters-befores)") +
  scale_y_continuous("Counts\n")

As you can see in the histogram above, while, in general, we’d expect to observe a mean difference of 0 or close to it much of the time in an experiment like ours for this species, we also see that we shouldn’t expect to always see such a small difference! In fact, about a third of the time, we’d expect to see a difference as large as 1 or more, either positive or negative. And about 5-10% of the time, the difference will be 2 or more, and differences of 4 or more even happened sometimes! That’s the two groups being in the ballpark of 50% different from each other in their means even though that’s all just a product of random chance!

This exercise should hopefully reinforce a foundational concept in statistics: The world is a random place, and, as such, even when the null hypothesis is true, our data may not always look favorable for the null hypothesis. Said a little differently, it’s still possible–just unlikely–to observe an “extreme” result even when the null hypothesis is true. So, even when data look favorable for the alternative hypothesis, that doesn’t mean we should necessarily dispense with the null hypothesis!

Just because something is rare doesn't mean we shouldn't still expect it to happen sometimes...

Just because something is rare doesn’t mean we shouldn’t still expect it to happen sometimes…

This exercise also hopefully gives you a sense of what I meant earlier when I said that we can often imagine what we’d expect to observe in a world where the null hypothesis is true. In essence, that’s exactly what we just did for species NPH. With a guess as to the mean and variance of the data we’ll observe for this species, we were able to simulate what collecting our data 1,000 times might look like. We can use that information to answer the important question we posed earlier: “How likely is it that the null hypothesis is wrong?”

In fact, that’s exactly what our p value is–the probabilistic answer to that question. For this species, in our actual experiment, the p value we received was 0.337, based on an observed mean difference between the “before” and “after” groups of 1.333. Essentially, that p value is saying “I can imagine, based on the variability in the data you’ve provided, that if I did the exact same data collection process you just did but instead bajillions of times and the null hypothesis were true, that I would get a difference this big or bigger about 34% of the time.

If we stare for long enough at the histogram above and consider where a value of 1.333 falls within it, that percentage hopefully starts to feel like a pretty good guess! 1.333 is about halfway between the center of the graph and the right edge, so I, at least, could believe that about a third of the 1,000 values in that histogram are values \(\geq\) |1.33|.

A nice recap of what I'm saying here.

A nice recap of what I’m saying here.

So, while the p value the lm() function was giving us was derived using a mathematical approach whereas our histogram was generated via simulation, they’re actually telling us the same thing: A result such as ours, while not perhaps super likely if the null hypothesis were true, is also not particularly unlikely either. We should expect such a result (or one even more questionable-looking for the null hypothesis) 1 out of every 3 times–or, to put a finer point on it, once every three studies like ours.

And this leads us to the all-important question, one that is equal parts practical and philosophical: If random stuff happens all the time, and I can sometimes get kooky looking data even when the null hypothesis is actually true, just how quirky do my data need to be before I can feel good about ruling out the null hypothesis in favor of the alternative hypothesis?

A nice sentiment amidst a topic all about rejection...

A nice sentiment amidst a topic all about rejection…

And there is a conventional answer to this question, and we’ll get to it! But first, just think about it abstractly for a moment. If you had an important decision to make–maybe you’re considering if you should buy a house you’ve seen–what level of certainty would you want that you were making the right decision? Would you accept a coin flip (a 50% chance) that your decision is good? What about a dice roll, needing to avoid just one number (that’s about an 85% chance)? 9 in 10? 99.9%? Would your threshold change if the decision were less momentous, like about what coffee to buy at the store?

My hunch is that, yes, it would change–you’d probably demand greater certainty that your decision was a favorable one as the decision gets more important. Well, in science, we have collectively decided that we want relatively high certainty that our decisions are good. This is especially true when we’re deciding whether to declare that something interesting or important is happening, and that we have gained new insights about the workings of the universe as a result!

As such, we’ve decided to generally use a threshold (formally called our alpha) of 0.05 or 5% as this dividing line. If the results we get are so unusual that, if the null hypothesis were really true, we’d only expect to get results like them or more extreme than them 5% of the time or less, those are odds we’re cool with! That is, receiving a p value of 0.05 or less is grounds to reject the null hypothesis as being relatively unlikely, statistically. It’s more likely, we’d be saying, that the null hypothesis is actually false than that it was true but yielded unflattering-looking results like ours by chance alone anyway.

You're getting carried away there! Let me bring you back down to reality...

You’re getting carried away there! Let me bring you back down to reality…

Let me be abundantly clear: It’s still absolutely possible, even when we get a p value < 0.05, that the null hypothesis really is still true! A p value of, say, 0.01 isn’t saying that you’d never get data that look like yours in a null-hypothesis world. In fact, it’s saying kind of the opposite: “Yeah, you can totally get data like yours when the null hypothesis is true. It’d happen 1 out of every 100 times, on average.”

Is 1 out of 100 times a lot? Well, ok, that likely depends on your perspective, I guess! But think about it this way: If you are a scientist for your entire career, and you do 100 tests in that time, at that rate, you’d reach the wrong conclusion once on average. I can’t tell you how to feel about that fact! It’s just what the numbers say, and the numbers can’t lie.

Rolling the Die

And now, we’re nearly to the heart of the matter! By this point, we’re hopefully more comfortable with what the null hypothesis is, how it relates to the p values we get back from tests, and why we need an alpha to use to decide when a p value should cause us to rule against a null hypothesis.

We’re ready, then, to think about those p values we generated earlier from our experiment. Here they are again, as a reference.

##  [1] 0.0099 0.0160 0.0529 0.0651 0.0674 0.0813 0.1072 0.1120 0.1385 0.1854
## [11] 0.2052 0.2287 0.3143 0.3274 0.3371 0.3399 0.3750 0.4365 0.4811 0.5060
## [21] 0.5488 0.5924 0.6093 0.6269 0.7240 0.7603 0.7636 0.8129 0.8192 0.8481
## [31] 0.8641 0.9228 0.9253 0.9323 0.9444 0.9513 0.9827

You’ll notice that, if we use our conventional alpha of 0.05, we’d be ruling against the null hypothesis for 2 out of our 37 species (i.e., there were two p values that were less than that cutoff). If we used a less conservative alpha of 0.1, we’d be doing so for 6 species! Even if we used a very conservative alpha of 0.01, we’d still be ruling out the null hypothesis for one species (and very nearly a second).

Why does all this matter? We know the null hypothesis is right in every case here!!! We set it up to be that way! All these significant results are the products of chance, and yet we’re about to conclude that “between 1 and 6 interesting and non-random things are happening!” And here’s the really key concept to appreciate about that: This isn’t the system breaking down; it’s the system working exactly as intended.

Why We Need to Appreciate How P values Work

Let me try to explain what I mean. Consider a p value of exactly 0.05. As we’ve seen, this p value means the result we got back from our study was relatively “extreme.” If the null hypothesis were actually true, you’d get results as extreme as this or more so only 5% of the time, on average.

By always using an alpha of 5% as our cutoff, we’d be saying “even though, 5% of the time, I should expect results this wacky just as a result of random chance, assuming all the null hypotheses I test against are true, I’m ok with being wrong about the null hypothesis being incorrect 5% of the time anyway, just so I can hopefully be right to reject it the other 95% of the time.” In other words, you’re agreeing to having a ~5% chance of making this particular error (rejecting a null hypothesis erroneously is called a Type I error) from the outset when you select an alpha of 5%!

Stats textbooks LOVE this analogy...come on folks, let's see some new material! I've got a *lot* of images I need to shamelessly crib for my blog posts and I like having some options...

Stats textbooks LOVE this analogy…come on folks, let’s see some new material! I’ve got a lot of images I need to shamelessly crib for my blog posts and I like having some options…

Let’s think about all this a little differently to reinforce the idea. Imagine a world in which all the null hypotheses are true. As we’ve seen, even a null-hypothesis world can produce wacky results through chance alone sometimes. By using the 5% cutoff, we’re saying that, for every 100 tests we do, we’re ok with rejecting the null hypothesis in 5 of those (on average), even though we just acknowledged that all the null hypotheses in this world we’re imagining are true. After some time, we’d have rejected every null hypothesis there is in this hypothetical universe even though all its null hypotheses are true. Let that sink in and try not to cry!

Perhaps not, but give us scientists time and we'll be convinced we know *everything* anyway, even if there's not actually anything to know!

Perhaps not, but give us scientists time and we’ll be convinced we know everything anyway, even if there’s not actually anything to know!

As something of a side note, consider that our cutoff, alpha, is about relative probabilities, not about “raw extremeness.” An alpha of 0.05 says that results “extreme” enough to occur 5% of the time or less under the null hypothesis are sufficiently “extreme” that we should reject the null hypothesis. However, it says nothing about how extreme in actual units such “extreme” results need to be, does it? It just notes that they should be “uncommon.”

This means that a profoundly trivial observed difference (like mean densities that differ by just 0.02 stems/meter) could be “significantly extreme” for one species but a profoundly large observed difference for another species in the same study could be considered “expected.” It’s all based on what the null hypothesis is thought to be capable of producing, results-wise. And, either way, 5% of the time, assuming the null hypothesis is right, we can expect to be wrong using 5% as our cutoff.

I sure am!

I sure am!

So, we’re hopefully now ready to ask a pivotal question about our p values from our experiment: How many of our significant p values (those that were less than our alpha) are likely to be “false positives,” ones spurring us to reject the null hypothesis even if it might still be right?

In a real study, we wouldn’t know the answer to this question! If we did, we wouldn’t need to be doing the study, probably! However, here we do know the answer: 100% of these significant p values are “false positives.” We designed this exercise to only have true null hypotheses, and yet we are poised to reject 2 of them (or 1, or 6, depending on the alpha we use) anyway. We’re about to declare that our herbicide is actually harmful (or beneficial) for two taxa that it actually doesn’t affect one way or the other. Bummer. Imagine what kinds of knock-on consequences these claims could have…

And, as my points above hopefully revealed, that’s exactly what we should have expected going into our study! Consider: If we did 37 tests, each with an alpha of 0.05 and assumed that all the null hypotheses were true, we could ask: “What fraction of those 37 tests should we expect to produce a significant p value, just by chance alone?”

This is a question we can answer! Or, rather, one that we can have R answer for us. Here’s what we can do: We can have R flip a “loaded coin” 37 times, with the probability of heads (here, a “false positive”) being equal to our alpha of 0.05. We can then count up how many heads we get out of 37 flips. That will simulate how many false positives we would get in an experiment exactly like ours. We can even do this same process 1,000 times and see the whole range of false positive counts we’d get…

total_heads = rep(NA, 1000) #Storage object

#Do the coin flipping for one experiment
for(rep in 1:1000) {
set_of_flips = sample(
  c(0,1), #Heads = 1s, tails = 0s
  size = 37, #One coin flip for each taxon
  replace=TRUE, 
  prob = c(0.95, 0.05)) #Set the probability of heads (false positives) to 0.05

#Add up the numbers of heads we see
total_heads[rep] = sum(set_of_flips)
}

#Visualize the results
ggplot(data=data.frame(total_heads = total_heads), aes(x=total_heads)) +
  geom_histogram(fill = "#7FFFD4", color="black",
                 aes(y = after_stat(density))) +
  stat_function(fun = dnorm, args = 
                  list(mean = mean(data.frame(total_heads = total_heads)$total_heads), sd = sd(data.frame(total_heads = total_heads)$total_heads)),
                linetype="dotted", color="gray15") +
  A.theme +
  scale_x_continuous('Number of heads ("false positives") obtained') +
  scale_y_continuous("Counts\n")

What the histogram above shows us is that, when we do 37 tests with an alpha for each test of 0.05 and when all the null hypotheses are assumed to be true, we should expect to see ~2 significant p values on average (or as few as 0 and as many as 6ish). That actually should make some sense. After all, what is 5% of 37? 1.85, so basically 2.

This is our next key concept: The more tests we do, even when they are trying to disprove a null hypothesis that is really true, the more significant results we will inevitably find anyway. This fact is simply a consequence of using a relative probability as our cutoff–if we’re ok being wrong 5% of the time in the worst-case scenario (i.e., all the null hypotheses are true), we will indeed be wrong 5% of the time in that scenario, just as we were in our actual experiment!

*Unfortunately, this means conducting bogus science is frighteningly easy–it’s a dirty little secret, but if you want to be assured of a significant p value, even when there’s nothing “significant” to be found, simply do enough tests (about 20 should do it), and you’re be all but certain to get at least one!

You can 'prove' anything with enough dedication. It'd be an uplifting message...if it weren't so *unsettling* of a reality...

You can ‘prove’ anything with enough dedication. It’d be an uplifting message…if it weren’t so unsettling of a reality…

If this all makes sense so far, then you must accept that the scientific literature is unfortunately probably polluted with tons of significant p values actually produced just by chance. This wouldn’t actually be as big a problem if studies were regularly and systematically replicated to see if they yield the same results multiple times. After all, getting a significant p value once is a 0.05 probability. Getting it 5 times out of 5 is a 0.05^5 probability, i.e., waaaay smaller! Of course, the vast majority of scientific findings are not rigorously replicated, but shhh, we don’t talk about that in mixed company!

Now, of course, many of the significant but unreplicated p values in the literature are probably perfectly virtuous (especially those tested against unlikely null hypotheses), but some are undoubtedly flukes too, and it’s really hard to know which is which. (Sidenote: Another dirty little secret–if you want a significant p value, just test against a “dumb” null hypothesis, such as that sunlight has no effect on plant growth!)

Well, gee, that makes me feel a __lot__ better!

Well, gee, that makes me feel a lot better!

Now, at this point, you might be wondering “But Alex, if we all collectively chose 5% as the cutoff, and I use 5% too, then I have a 5% chance of polluting science with fluky p values from my study, just like everyone else. What more should I really need to do?” My answer–which you might be surprised and disheartened to hear!--would be that you do not have a 5% chance of polluting science with your study–it’s actually likely much higher than that…

Recall that we did a whopping 37 tests in our hypothetical experiment, one for each taxon. The 5% probability of producing a “false positive” (when the null hypothesis is true) is 5% per test we do. That means that the number of false positives our study will produce is a function not only of the alpha we choose for each test but also of how many tests we do.

In essence, the histogram I showed earlier–you know, this one:

is the expected range of false positives we could expect a study like ours to produce, given 37 tests and a per-test alpha of 0.05 (assuming all the null hypotheses are true). That means our study is actually poised to pollute the scientific literature with ~2 p values we happen to know will lead to erroneous conclusions here.

Here’s another, more general way to think about this problem: The probability of producing at least one false positive is equal to one minus the probability of producing no false positives (these are mutually exclusive outcomes, so they must sum to 1). The probability of not producing a false positive in a single test is 95% (100%-5%, if our alpha is 5%), which means the probability of not producing a false positive over 37 tests is 0.95^37, or 0.15. This means we have about an 85% chance of producing at least one false positive in our study (and perhaps more than 1!). That’s way higher than our alpha, right? For context, that’s like needing to roll a die and only get a 6 to succeed. Do you like those odds?

Now that I’ve maybe filled you with ennui and despair, we can finally consider the most important question: “What ever can we do about all these false positives we all seem destined to uncover???!”

Taking Corrective Actions

Here’s the first piece of great news: The actual situation on the ground is probably less dire than I just made it seem. After all, many of the null hypotheses we test against are probably false, at least to some degree. The world is full of interesting, non-random processes, for one thing! For another, most scientists choose which hypotheses to test in a highly informed manner, so we tend to go into studies with a good feeling about the null hypothesis being false.

This is a key concept: Our alpha is our probability of a “false positive” only if our null hypothesis is exactly true. The more obviously untrue the null hypothesis is, the more often a well-designed study will produce results unflattering to it, and the more often we will reject it. Whew!

Here’s the other piece of great news: There are many things we can all do to lessen the potential impacts caused by significant p values that are actually just the results of random chance, both as researchers conducting statistical tests and reporting their results and as consumers of scientific findings.

As consumers, we can:

  • Assume as a starting point, going into each study we read, that X% of the significant results reported might be “false positives,” where X is based on the study’s alpha (usually, 5%). To some extent, I do this–if a paper were expected to get 3 significant results, given their number of tests and their alpha, and they instead get 24, I might find that more compelling evidence than a paper expecting 3 significant results and actually got 4.

  • Act as though every individual significant result is a potential fluke until it is rigorously replicated. Once or twice may well just be the null-hypothesis world messing with us, but five or six times can be the start of a pattern!

  • Be extra cautious when a paper reports a whole busload of p values. Remember: The more tests that are conducted, the more that will inevitably yield significant results! In other words, you should assume all tests will err (with respect to false positives) at a rate equal to alpha. I’m especially critical when it feels like many of the tests are asking statistical questions that seem only tangentially related to the core hypotheses of the study. Beware of studies that seem like “fishing expeditions,” those conducting a large number of tests without a clear and logical justification.

  • Be skeptical when tests are vague about what null hypothesis was being tested and especially nonplussed when it seems as though the null hypothesis being tested is particularly “dumb” (i.e., reasonably unlikely to be true from the outset). When the null hypothesis is really dumb, it’s much more likely we’ll reject it, but that doesn’t mean we’ve gained any new insights in the process!

  • Assume that publication bias is occurring all the time. That is, assume that many people are doing good science but get non-significant p values. As a result, they then assume their work isn’t publishable or with merit or valid, and so they don’t publish it (or editors reject it). As such, it’s best to assume that many studies actually have been replicated and, in all those instances we aren’t seeing because of publication bias, no significant results were observed.

What we don't see *can* hurt us. Sounds like a pretty good tagline for a horror flick!

What we don’t see can hurt us. Sounds like a pretty good tagline for a horror flick!

Meanwhile, as researchers, we can:

  • Set clear study goals and expectations up front and conduct the minimum number of statistical tests needed to assess them. My motto is “Don’t ask statistical questions for which you don’t need or want the statistical answers.” A good rule of thumb is to imagine that you will need to write an entire Discussion paragraph for every single p value your study generates. If you have 37 tests, that’s 37 paragraphs. It’ll take you eight months to write, and absolutely no one would want to publish nor read it! Be selective in your inquiries, and don’t go on unplanned fishing expeditions.

  • Choose a more conservative alpha whenever appropriate. If the thought of finding false positives (up to) 5% of the time bothers you, then choose a lower alpha! In fact, some have contended, with the proliferation of p values that has occurred in recent years, that we should all consider using alphas of 0.01 or even lower for most tests. Just because 5% is the convention doesn’t mean that it’s “right;” we can all feel free to choose our own level of risk that we’re comfortable with. And that level can shift as the stakes of our research change–maybe for exploratory work, 5% or even 10% is ok, whereas you might use 1% (or lower) for definitive tests of effectiveness.

  • Adjust or “correct” your p values upward when appropriate. More on this below!!!

  • Be explicit and transparent with our readers about what our null hypotheses are and what our alpha is. Furthermore, be circumspect, when discussing our results and acknowledge that significant but unreplicated results should be interpreted as tentative.

  • Select informed null hypotheses whenever possible. Rather than testing to see if something has zero effect when that is super unlikely based on what we already know, maybe instead test whether the effect is greater than or less than a reasonable, small, and not meaningful effect size, such as 2 (most statistical tests don’t really care what value the null hypothesis centers on, just that there is one!).

  • Be wary when conducting post-hoc tests. These are where we ask which levels of a categorical variable are significantly different from each other. If you have a variable with, say, 10 levels, and you use a Tukey’s test to compare each level to each other, that is (10x11)/2 55 tests and associated p values. I shouldn’t have to tell you now that that is a lot of tests! There’s no way that you or your readers are actually critically interested in that many comparisons.

    So, if you must perform a post-hoc test like this, do so with a small, curated list of comparisons in mind and report only those comparisons, regardless of any other significant results that may pop up.

  • Be wary when using model selection approaches. These approaches are where you may have many potential models, or many possible explanatory variables, that you could use. To figure out the “best” one, you try some or all of them and select one as “superior” based on one or more metrics. While these techniques absolutely have their uses and merits, they often favor the retention of significant variables and the exclusion of non-significant ones. In that sense, they are “cheating” as far as p values are concerned. If you really do 50 tests (one for each potential variable) and retain just the “best-looking” ones, the probability that at least one of those significant p values is actually a fluke is often much greater than 5%.

  • Consider that even trivial effects can still be “statistically significant,” yielding small p values; conversely, even seismic effects can still be “statistically non-significant.” The p values we get from a test are a function of how variable our data are, how big our sample size is, and what the null hypothesis is expected to be capable of producing, among other factors. As such, small and variable samples “empower” null hypotheses by suggesting that the probability of even an “extreme” event (in raw units) occurring is high.

    Given that fact, make sure to discuss the practical/biological significance of your results too. If a p value is significant, ask: would the raw observed difference or effect be meaningful? If a diet pill “significantly” reduced weights by an average of 0.01 lbs over the course of a year, would you pay $1000 dollars a month for that weight loss pill? Probably not! So, never stop at statistical significance; instead start at statistical significance, and stop at the relevance (and logic) of the effect indicated!

While this article is ostensibly about p value corrections, the lists above hopefully demonstrate that a holistic approach to solving this problem is needed! We really ought to be vigilant in many ways to protect ourselves against “false positives.” However, I want to highlight p value correction as one potential tool in our arsenal because its logic is approachable, it’s easy to do in R, and reviewers may well suggest that it be done if you submit a paper with a lot of p values in it.

Moving (P Values) On Up

The basic idea of p value adjustment (or “correction”) is relatively straightforward. If performing multiple tests increases our chances of making at least one “false positive” (Type I) error across our entire study, then this is sort of like saying “at the level of my entire study, my effective alpha is either too high or my p values are deceptively too low.”

In other words, if the “standards” by which we are drawing statistical conclusions at the level of the individual test (i.e., comparing a p value to an alpha) is leading to adverse outcomes at the level of the study, it makes sense to reconsider those standards and modify them to “correct” the issue.

If there's one thing us statisticians know about, it's being right *technically*.

If there’s one thing us statisticians know about, it’s being right technically.

One common correction is known as the Bonferroni correction. This correction revolves around the idea of a family-wise error rate (FWER). You can think of the FWER as just alpha but for an entire study–it’s the probability of getting at least 1 false positive across an entire family of tests given that all the null hypotheses are true.

We saw earlier in this post that the probability of observing one or more false positives is 1 - the probability of observing none, the latter of which is calculated: (1 - alpha) ^ number of tests. So, that means that the FWER = 1 - (1 - alpha) ^ n. Now, I know, that’s a lot of “1 -”s in there. So, let’s pause to appreciate what this formula is suggesting:

This suggests that if we want to set a “family-wide” alpha (the FWER) that is less than our per-test alpha (let’s call this one alpha-t), but we’re stuck with the number of tests we have (n), then we have only one option: to shrink alpha-t until the FWER is equal to the value we want, and to do so more harshly the more tests we do.

That leads to the so-called “Bonferroni correction,” which is to divide our alpha-t by n. So, in our case, that means our effective alpha for each test should really be 0.05/37 or 0.0014.

...I know, it sounds more like a dry pasta brand than a statistical tool...

…I know, it sounds more like a dry pasta brand than a statistical tool…

In practice, though, this is rarely how the Bonferroni correction is used. Why? Well, many find it confusing to have alphas that are not nice, round numbers. If I see a p value of 0.00132, it’s convenient for me as the reader to not have to flip back to see exactly what the alpha was to know whether that was significant or not. We’re already stuck with p values that are rarely going to be round numbers; it’s nice for our alphas to be round numbers, at least!

This is why, instead, the Bonferroni correction is often applied to the p values we received instead of to our alphas. Here, we instead multiply each of our p values by our number of tests (n), resulting in higher values than before. Here’s what our p values from our original test would look like if we did this.

#Do the correction and put the values in order.
p_vals_BC = sort(round(p_vals*37, 3)) 

#We'll get a lot of *p* values that are > 1 if we do this like this, so we should set all those to 1 manually.
p_vals_BC[p_vals_BC > 1] = 1

p_vals_BC
##  [1] 0.366 0.591 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [13] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [25] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [37] 1.000

Notice how this correction entirely changes our interpretations of these p values! Now, we’re not rejecting a single null hypothesis. In fact, we’re not even coming close to doing so.

Of course, we didn’t actually need to do this correction “by hand.” There’s an R function called p.adjust() that will do this for us, but it’s nice to know what it’s doing, right? Right.

round(sort(p.adjust(p_vals, method = "bonferroni")),3)
##  [1] 0.366 0.591 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [13] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [25] 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## [37] 1.000

Now, there are two things worth pointing out at this juncture. First, note that our “standard” interpretation of a p value as the probability of getting a result this extreme or more so given the null hypothesis is true does not hold for these correctedp values. This is one major downside to p value correction in general, and one reason in favor of correcting alpha instead even though that is rarely done.

This means that, if we want our readers to be able to interpret our p values in isolation correctly, we would need to report both our corrected and uncorrected p values, the former for reaching the statistical conclusion about the null hypothesis given our alpha and the latter for getting a sense of just how unlikely our data suggest the null hypothesis actually is.

The second thing worth pointing out is that the Bonferroni correction is, by any measure, pretty harsh! Just look at all those 1s we produced when we used it. Here’s where it’s worth considering that there are four possible outcomes from any statistical test: We could either reject or fail to reject the null hypothesis, and we could also be either right or wrong to make the decision about rejecting the null hypothesis that we’ve made.

Statisticians love the word 'error', don't they...

Statisticians love the word ‘error’, don’t they…

Now, when we’re right, we’re right! Let’s set those two possibilities aside. We’re left then with the two ways we can be wrong. We’ve already seen that rejecting the null hypothesis when it’s true is a “false positive” (Type I) error. However, we can also fail to reject the null hypothesis when it’s wrong too; that’d be a “false negative” (Type II) error.

Assuming that the “wrongness” or “rightness” of a null hypothesis is held constant, the probabilities of committing our two types of errors are in conflict. The more we try to prevent one of the errors, the more likely it is that we will commit the other error instead.

This means that, by working really hard to not commit errors of “false positives,” we substantially increased our likelihood of committing errors of “false negatives” instead. Here, that isn’t the case, but only because we know here that all the null hypotheses are false! That’s a luxury we don’t have in real studies (usually).

TBH, I'd probably read it...

TBH, I’d probably read it…

Perhaps you’ve heard of the concept of statistical power. Power is the probability of not committing a “false negative” (Type II) error, and is often represented as beta. Defined differently, power is the probability you will reject the null hypothesis when you ought to (a good thing!), making it the logical converse of alpha.

Assuming that we don’t know whether a null hypothesis is right or wrong, by adjusting alpha, we also adjust beta. By ratcheting alpha way down here to avoid making Type I errors, we’ve effectively ratcheted beta way down too–the two are essentially joined at the hip.

Luckily, in our case, beta is actually 0 in our study–all our null hypotheses are right. This might break your brain to think about, but in every single test, alpha is always 0 or beta is. After all, either the null hypothesis is right or it isn’t (more mutually exclusive outcomes!). So, here, changing our alpha is “risk-free;” it comes with no “beta penalty.”

However, imagine our actual experiment, had I not sabotaged it. Our actual experiment would have used real herbicide, which should have had a real impact on many species. For those species, the null hypothesis would actually be wrong, but we wouldn’t know that for sure. If we had used the Bonferroni correction on the p values for those species, because of how harshly it punishes our power, we’d probably have needed to observe a truly devastating effect of our herbicide to be able to reject the null hypothesis like we ought to.

How come no one with access to a genie asks for unlimited *statistical* power? Now that's a movie I'd watch!

How come no one with access to a genie asks for unlimited statistical power? Now that’s a movie I’d watch!

Long story short: Many people like getting significant p values (for both good and not as good reasons), so they don’t like the Bonferroni correction. In the tug of war between preventing one type of error and preventing the other, they would argue this makes it far too likely we’ll end up with “false negatives.” “Why even do the study?”, they might ask!

But, at the same time, many would argue that not correcting one’s alpha or p values when reporting on this many tests goes too far in the other direction, making it far too easy to dismiss null hypotheses in all those situations that we ought not to (and we can never quite be sure when we’ll find ourselves in those situations!).

Is there a middle ground? Well, maybe.

A Sense of (False) Discovery

There’s an alternative set of corrections that, instead of being centered on the FWER, is centered on the False Discovery Rate (FDR). These essentially say “Hey, rather than punishing our alpha or p values for every single test we do (even those that didn’t even lead to rejection of the null hypothesis in the first place) in an effort to avoid”false positives,” why not punish based just on the number of positives we get? After all, those are where are Type I errors must lie.

In other words, the FDR is the expected proportion of “false positives” we get out of all “positives” (“false” plus “true”) that we get. Thus, so long as no more than 5% of our positives are expected to be false, then at least we’re not adding (many) more “false positives” to the literature than anyone else, right?

A film entirely focused on false positives? Sounds like a statistician's worst fear!

A film entirely focused on false positives? Sounds like a statistician’s worst fear!

So, controlling for the FDR really just requires scrutinizing the positive (aka significant) p values we get back. The most common way to correct p values to control for the FDR is by using the Benjamini & Hochberg method, which is a procedure with three steps.

First, you order your p values in ascending order (smallest to largest). You then multiply each one by m/k, where k is the p value’s rank (the smallest p value is rank 1, and, if you had 37 p values, then the largest one would be rank 37) and m is the number of tests you did. Let’s pause to consider what happens to m/k as we move from our smallest to our largest p value.

37/1:37
##  [1] 37.000000 18.500000 12.333333  9.250000  7.400000  6.166667  5.285714
##  [8]  4.625000  4.111111  3.700000  3.363636  3.083333  2.846154  2.642857
## [15]  2.466667  2.312500  2.176471  2.055556  1.947368  1.850000  1.761905
## [22]  1.681818  1.608696  1.541667  1.480000  1.423077  1.370370  1.321429
## [29]  1.275862  1.233333  1.193548  1.156250  1.121212  1.088235  1.057143
## [36]  1.027778  1.000000

As you can see, m/k gets smaller kind of quickly, so it’s possible your resulting p values will not be sorted in strictly ascending order after you multiply them by m/k. If this happens, both the higher- and lower-ranked adjusted p values will get set to the lower value. For example, if two uncorrected neighboring p values are 0.02 and 0.1 and, when adjusted, they become 0.15 and 0.14, they will both become 0.14 instead so the order stays strictly ascending without having to reorder everybody.

So, we’ll be multiplying our low p values by large numbers and our high p values by smaller numbers. If even our lowest p values aren’t really all that low (and thus our null hypotheses really doesn’t look too implausible), then this will likely push many of those p values above our alpha.

Anyway, let’s see what this does to our p values from our experiment.

#Multiply our sorted p values by our m/k
ps_fdr = sort(p_vals) * (37/1:37)

#Replace any values > 1 with 1
ps_fdr[ps_fdr > 1] = 1

#Make sure that no later p value is < than any earlier one, so they stay in strictly ascending order.
for(i in 1:36) {
  #What's the smallest p value ahead of the current entry?
  min_p = min(ps_fdr[i:36])
  
  #If the current value is larger than that one, make it equal to that one instead.
  if(ps_fdr[i] > min_p) {
    ps_fdr[i] = min_p
  }
}

#See the results
ps_fdr
##  [1] 0.2956216 0.2956216 0.4989924 0.4989924 0.4989924 0.5014855 0.5182277
##  [8] 0.5182277 0.5694022 0.6859010 0.6902728 0.7052044 0.7860682 0.7860682
## [15] 0.7860682 0.7860682 0.8162826 0.8972316 0.9360155 0.9360155 0.9664455
## [22] 0.9664455 0.9664455 0.9664455 0.9776746 0.9776746 0.9776746 0.9776746
## [29] 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746
## [36] 0.9776746 0.9826954

Hopefully, two things are apparent from these p values. First, they are still much higher than our original ones, though much more so at the low end of the spectrum than at the high end. Second, they are still much lower than those we produced using the Bonferroni correction. As such, they represent a “compromise” between our uncorrected and our Bonferroni-corrected p values, and, importantly, they also suggest we shouldn’t reject any null hypotheses here.

As before, we didn’t actually need to do this math “by hand,” as the p.adjust() function can handle this correction also:

p.adjust(sort(p_vals), method = "fdr")
##  [1] 0.2956216 0.2956216 0.4989924 0.4989924 0.4989924 0.5014855 0.5182277
##  [8] 0.5182277 0.5694022 0.6859010 0.6902728 0.7052044 0.7860682 0.7860682
## [15] 0.7860682 0.7860682 0.8162826 0.8972316 0.9360155 0.9360155 0.9664455
## [22] 0.9664455 0.9664455 0.9664455 0.9776746 0.9776746 0.9776746 0.9776746
## [29] 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746 0.9776746
## [36] 0.9776746 0.9826954

Ending On A Philosophical Note

So, we’ve seen why one would correct one’s p values, and we’ve seen how one would correct one’s p values. The last thing I think we ought to discuss is should we correct our p values?

There is actually not a consensus on this question. For example, there is no agreed-upon threshold number of tests above which one “ought” to correct their p values. There is also not a single agreed-upon method for doing so. For example, there are seven different methods for p value correction built into the p.adjust() function, and there are others out there beyond those ones.

Plus, there is not a consensus about whether correcting for the FWER or the FDR is “better.” I’ll add that while there seems to be a consensus that correcting p values is preferable to correcting one’s alphas, I’d wonder if that consensus is justified, for reasons I raised earlier in this post.

P values make sense whether you can make sense of them or not!

P values make sense whether you can make sense of them or not!

I’ll also admit that we end up in a very strange place when we push the logic of p value correction to its extremes. If I have an elevated chance of producing “false positives” by reporting many tests in one paper, what about across all the papers I publish in my career? What about all the studies published in one edition of a journal? Or all the studies published ever? The argument goes that it’s arbitrary to draw the line for p value corrections at the paper when the problem extends far beyond that.

My counterargument would be that I have a limited ability to influence how many tests others perform or how they perform them. I also don’t know how many tests I’ll perform in future studies. However, I do have influence over how many tests I am doing in my current study, and also in how I’ve done them. So, to me, correcting at the level of the study feels “right.” Your perspective might well be different.

So, while I can’t answer the question of “should you adjust your p values?” objectively, I can give you some considerations that may guide you to the right answer for your context:

If you ever see an ecologist crying, this is probably why.

If you ever see an ecologist crying, this is probably why.

I’ll close by just re-iterating a point I made earlier: Except for very exploratory, low-stakes, foundational research, a “kitchen sink” approach is often not the best one, at least when you’re conducting hypothesis-driven work. What I mean is: Don’t gather a bazillion variables and combine them into a bazillion tests and sub-tests just to see “what sticks.”

Instead, ask a precise question, design a precise study, gather data selectively, and build a tight and cohesive analytical model, and you’ll likely never have to worry too much about how different your conclusions would be if you had corrected your p values!

And if you’re ever asked to correct your p values, you probably ought to do it, even if you report your uncorrected p values also. More context for the reader is rarely worse than less!

More Resources

Here’s a list of some resources I scanned prior to crafting this article, in case you’d like to do some further reading!